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Abstract 

^ Statistical models with latent structure have a history going back to the 1950s and have 

seen widespread use in the social sciences and, more recently, in computational biology and 
^ in machine learning. Here we study the basic latent class model proposed originally by the 

^ t/^^ sociologist Paul F. Lazarfeld for categorical variables, and we explain its geometric structure. 

We draw parallels between the statistical and geometric properties of latent class models and 

1— I we illustrate geometrically the causes of many problems associated with maximum likelihood 

J> estimation and related statistical inference. In particular, we focus on issues of non-identifiability 

and determination of the model dimension, of maximization of the likelihood function and on 
the effect of symmetric data. We illustrate these phenomena with a variety of synthetic and 

^ real-life tables, of different dimension and complexity. Much of the motivation for this work 

stems from the "100 Swiss Francs" problem, which we introduce and describe in detail. 
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1 Introduction 



Latent class (LC) or latent structure analysis models were introduced in the 1950s in the social 
science literature to model the distribution of dichotomous attributes based on a survey sample 
from a populations of individuals organized into distinct homogeneous classes on the basis of an 
unobservable attitudinal feature. See Anderson (1954), Gibson (1955), Madansky (1960) and, in 
particular, Henry and Lazarfeld (1968). These models were later generalized in Goodman (1974), 
Haberman (1974), Clogg and Goodman (1984) as models for the joint marginal distribution of a set 
of manifest categorical variables, assumed to be conditionally independent given an unobservable 
or latent categorical variable, building upon the then recently developed literature on log-linear 
models for contingency tables. More recently, latent class models have been described and studied 
as a special cases of a larger class of directed acyclic graphical models with hidden nodes, sometimes 
referred to as Bayes nets, Bayesian networks, or causal models, e.g., see Lauritzen (1996), Cowell et 
al. (1999), Humphreys and Titterington (2003) and, in particular, Geiger et al. (2001). A number 
of recent papers have established fundamental connections between the statistical properties of 
latent class models and their algebraic and geometric features, e.g., see Settimi and Smith (1998, 
2005), Smith and Croft (2003), Rusakov and Geiger (2005),Watanabe (2001) and Garcia et al. 
(2005). 

Despite these recent important theoretical advances, the basic statistical tasks of estimation, 
hypothesis testing and model selection remain surprisingly difficult and, in some cases, infeasible 
tasks, even for small latent class models. Nonetheless, LC models are widely used and there is a 
"folklore" associated with estimation in various computer packages implementing algorithms such 
as EM for estimation purposes, e.g., see Uebersax (2006a,b). 

The goal of this article is two-fold. First, we offer a simplified geometric and algebraic de- 
scription of LC models and draw parallels between their statistical and geometric properties. The 
geometric framework enjoys notable advantages over the traditional statistical representation and, 
in particular, offers natural ways of representing singularities and non-identifiability problems. Fur- 
thermore, we argue that the many statistical issues encountered in fitting and interpreting LC mod- 
els are a reflection of complex geometric attributes of the associated set of probability distributions. 
Second, we illustrate with examples, most of which quite small and seemingly trivial, some of the 
computational, statistical and geometric challenges that LC models pose. In particular, we focus 
on issues of non-identifiability and determination of the model dimension, of maximization of the 
likelihood function and on the effect of symmetric data. We also show how to use symbolic soft- 
ware from computational algebra to obtain a more convenient and simpler parametrization and for 
unravelling the geometric features of LC models. These strategies and methods should carry over 
to more complex latent structure models, such as in Bandeen-Roche et al. (1997). 

In the next section, we describe the basic latent class model and introduce its statistical proper- 
ties and issues, and we follow that, in Section 3, with a discussion of the geometry of the models. In 
Section 4, we turn to our examples exemplifying identifiability issues and the complexity of the like- 
lihood function, with a novel focus on the problems arising from symmetries in the data. Finally, we 
present some computational results for two real-life examples, of small and very large dimension, 
and remark on the occurrence of singularities in the observed Fisher information matrix. 
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2 Latent Class Models for Contingency Tables 



Consider k categorical variables, Xi, . . . where each Xi takes value on the finite set [di] = 
{1, . . . ,di}. Letting V = ^\^i[di\, MP is the vector space of of /c-dimensional arrays of the for- 
mat di X . . . X dk, with a total of d = Hi '^i entries. The cross-classification of N independent 
and identically distributed realizations of {Xi, . . . ,Xk) produces a random integer- valued vector 
n G M^, whose coordinate entry rij- ... corresponds to the number of times the label com- 
bination was observed in the sample, for each {ii,...,ik) G D. The table n has a 
Multinomiald(A^, p) distribution, where p is a point in the {d — l)-dimensional probability simplex 
Arf_i with coordinates 

Pii,...,ik = Pr{{Xi, ...,Xk) = {ii,.. . , (ii,. . • ,ifc) G V. 

Let H be an unobservable latent variable, defined on the set [r] = {1, . . . , r}. In its most basic 
version, also known as the naive Bayes model, the LC model postulates that, conditional on H, the 
variables Xi, . . . ,Xk are mutually independent. Specifically, the joint distributions oi Xi, . . . ,Xk 
and H form the subset V of the probability simplex Adr-i consisting of points with coordinates 

Ph,...,ik,h=Pi'Hh) ■■■pl^\ik)>^h, {h,...,ik,h) eV x[r], (1) 

where Xh is the marginal probability Pr{H = h} and p[^\ii) is the conditional marginal probability 
Pr{Xi = ii\H = h}, which we assume to be strictly positive for each h G [r] and (zi, . . . , i^) G V. 

The log-linear model specified by the polynomial mapping (1) is a decomposable graphical 
model (see, e.g, Lauritzen, 1996) and V is the image set of a homeomorphism from the parameter 
space 

e = {e:9 = {p[''\ii)...pf\ik),Xh),{ii,...,ik,h)eVx[r]} 
= (g)^ Arf^_i X Ar„i, 

so that global identifiability is guaranteed. The remarkable statistical properties of this type of 
model and the geometric features of the set V are well understood. Statistically, equation (1) 
defines a linear exponential family of distributions, though not in its natural parametrization. The 
maximum likelihood estimates, or MLEs, of and (ii) exist if and only if the minimal sufficient 
statistics, i.e., the empirical joint distributions of {Xi,H) for i = 1,2,..., A;, are strictly positive 
and are given in closed form as rational functions of the observed two-way marginal distributions 
between Xi and H ior i = 1,2, . . . ,k. The log-likelihood function is strictly concave and the global 
maximum is always attainable, possibly on the boundary of the parameter space. Furthermore, 
the asymptotic theory of goodness-of-fit testing is fully developed. The statistical problem arises 
because H is latent and unobservable. 

Geometrically, we can obtain the set V as the intersection of Aj^^-i with an affine variety (see, 
e.g.. Cox et al., 1996) consisting of the solutions set of a system of r fli (2') homogeneous square- 
free polynomials. For example, when k = 2, each of these polynomials take the form of quadric 
equations of the type 

Pil,i2,hPi[,i'2,h = Pi[,i2,hPii,i'2,h^ (2) 

with ii / i[, 12 / i'2 and for each fixed h. Equations of the form (2) are nothing more than 
conditional odds ratio of 1 for every pair {Xi,Xii) given H = /land, for each given h, the coordinate 
projections of the first two coordinates of the points satisfying (2) trace the surface of independence 
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inside the simplex A^_i. The strictly positive points in V form a smooth manifold whose dimension 
is r Y\-{di — 1) + {r — 1) and whose co-dimension corresponds to the number of degrees of freedom. 
The singular points in V all lie on the boundary of the simplex A^r-i and identify distributions with 
degenerate probabilities along some coordinates. The singular locus of V can be described similarly 
in terms of stratified components of V, whose dimensions and co-dimensions can also be computed 
explicitly. 

Under the LC model, the variable H is unobservable and the new model ?^ is a r-class mixture 
over the exponential family of distributions prescribing mutual independence among the manifest 
variables Xi, . . . ,Xk. Geometrically, H is the set of probability vectors in Ad-i obtained as the 
image of the marginalization map from A^f^-i onto A(f_i which consists of taking the sum over the 
coordinate corresponding to the latent variable. Formally, H is made up of of all probability vectors 
in Ad-i with coordinates satisfying the accounting equations (see, e.g., Henry and Lazarfeld, 1968) 

Piu-,ik = Yl Pi^'-'ik,h = Yl Pf^^^^) ■ ■■Pk'\ik)Xh, (3) 
/ie[r] h€[r] 

where {ii, . . . ,ik,h) £T>x [r]. 

Despite being expressible as a convex combination of very well-behaved models, even the sim- 
plest form of the LC model (3) is far from well-behaved and, in fact, shares virtually none of the 
properties of the standard log-linear models (1) described above. In particular, latent class mod- 
els described by equations (3) do not define exponential families, but instead belong to a broader 
class of models called stratified exponential families (see Geiger et al., 2001), whose properties are 
much weaker and less well understood. The minimal sufficient statistics for an observed table n are 
the observed counts themselves and we can achieve no data reduction via sufficiency. The model may 
not be identifiable, because for a given p G A^-i defined by (3), there may be a subset of Q, known 
as the non-identifiable space, consisting of parameter points all satisfying the same accounting equa- 
tions. The non-identifiability issue has in turn considerable repercussions for the determination of 
the correct number of degrees of freedom for assessing model fit and, more importantly, on the 
asymptotic properties of standard model selection criteria (e.g. likelihood ratio statistic and other 
goodness-of-fit criteria such as BIG, AIG, etc), whose applicability and correctness may no longer 
hold. 

Computationally maximizing the log-likelihood can be a rather laborious and difficult task, par- 
ticularly for high dimensional tables, due to lack of concavity, the presence of local maxima and 
saddle points, and singularities in the observed Fisher information matrix. Geometrically, H is no 
longer a smooth manifold on the relative interior of A^.i, with singularities even at probability 
vectors with strictly positive coordinates, as we show in the next section. The problem of character- 
izing the singular locus of H and of computing the dimensions of its stratified components (and of 
the tangent spaces and tangent cones of its singular points) is of statistical importance: singularity 
points of H are probability distributions of lower complexity, in the sense that they are specified 
by lower-dimensional subsets of 0, or, loosely speaking, by less parameters. Because the sample 
space is discrete, although the singular locus of H has typically Lebesgue measure zero, there is 
nonetheless a positive probability that the maximum likelihood estimates end up being either a 
singular point in the relative interior of the simplex A^-i or a point on the boundary. In both cases, 
standard asymptotics for hypothesis testing and model selection fall short. 
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3 Geometric Description of Latent Class Models 



In this section, we give a geometric representation of latent class models, summarize existing results 
and point to some of the relevant mathematical literature. For more details, see Garcia et al. (2005) 
and Garcia (2004). 

The latent class model defined by (3) can be described as the set of all convex combinations of 
all r-tuple of points lying on the surface of independence inside Ad-i- Formally, let 

a: A'^i-i X . . . X A^*^-! Ad-i 

{pi{ii),...,Pkiik)) ^ UjPjiij) 

be the map that sends the vectors of marginal probabilities into the A;-dimensional array of joint 
probabilities for the model of complete independence. The set S = a{A'^'^^^ x . . . x A'^'-^^) is a 
manifold in Ad-i know^n in statistics as the surface of independence and in algebraic geometry (see, 
e.g. Harris, 1992) as (the intersection of A^-i with) the Segre embedding of P'^i^^ x . . . x P'^fc^i 
into F'^~^. The dimension of 5 is — 1), i.e., the dimension of the corresponding decomposable 
model of mutual independence. The set H can then be constructed geometrically as follows. Pick 
any combination of r points along the hyper-surface S, say p^^\ . . . , p^'''\ and determine their 
convex hull, i.e. the convex subset of Ad-i consisting of all points of the form J2hP^^^^h, for 
some choice of (Ai, . . . , Xr) G A^-i. The coordinates of any point in this new subset satisfy, by 
construction, the accounting equations (3). In fact, the closure of the union of all such convex 
hulls is precisely the latent class model H. In algebraic geometry, H would be described as the 
intersection of A^-i with the r-th secant variety of the Segre embedding mentioned above. 




Figure 1: Surface of independence for the 2x2 table with 3 secant lines. 

Example 3.1. The simplest example of a latent class model is for a 2 x 2 table with r = 2 latent 
classes. The surface of independence, i.e. the intersection of the simplex A3 with the Segre variety, 
is shown in Figure 1 . The secant variety for this latent class models is the union of all the secant 
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lines, i.e. the lines connecting any two distinct points lying on the surface of independence. Figure 
1 displays three such secant lines. It is not to hard to picture that the union of all such secant lines is 
the enveloping simplex A3 and, therefore, H fills up all the available space (for formal arguments, 
see Catalisano et al., 2002, Proposition 2.3). 

The model H, thought of as a portion of the r-th secant variety to the Segre embedding, is 
not a smooth manifold. Instead, it is a semi-algebraic set (see, e.g., Benedetti, 1990), clearly 
singular on the boundary of the simplex, but also at strictly positive points along the (r — l)st 
secant variety, (both of Lebesgue measure zero). This means that the model is singular at all points 
in H which satisfy the accounting equations with one or more of the X^'s equal to zero. In Example 
3.1 above, the surface of independence is a singular locus for the latent class model. From the 
statistical viewpoint, singular points of H correspond to simpler models for which the number of 
latent classes is less than r (possibly 0). As usual, for these points one needs to adjust the number 
of degrees of freedom to account for the larger tangent space. 

Unfortunately, we have no general closed-form expression for computing the dimension of H 
and the existing results only deal with specific cases. Simple considerations allow us to compute an 
upper bound for the dimension of H, as follows. As Example 3.1 shows, there may be instances for 
which H fills up the entire simplex Ad-i, so that d — 1 is an attainable upper bound. Counting the 
number of free parameters in (3), we can see that this dimension cannot exceed r — 1) + r — 1, 
c.f., (Goodman, 1974, page 219). This number, the standard dimension, is the dimension of the fully 
observable model of conditional independence. Incidentally, this value can be determined mirroring 
the geometric construction of ?^ as follows (c.f., Garcia (2004)). The number rj^ii^i — 1) arises 
from the choice of r points along the YliA'^i ~ 1) -dimensional surface of independence, while the 
term r — 1 accounts for the number of free parameters for a generic choice of (Ai, . . . , Ar) € A^-i. 
Therefore, we conclude that the dimension of TL is bounded by 



a value known in algebraic geometry as the expected dimension the variety H. 

Cases of latent class models with dimension strictly smaller than the expected dimension have 
been known for a long time, however. In the statistical literature, Goodman (1974) noticed that the 
latent class models for 4 binary observable variables and a 3-level latent variable, whose expected 
dimension is 14, has dimension 13. In algebraic geometry, secant varieties with dimension smaller 
than the expected dimension (4) are called deficient (e.g., see Harris, 1992). In particular. Exercise 
11.26 in Harris (1992) gives an example of deficient secant variety, which corresponds to a latent 
class model for a 2-way table with a latent variable taking on 2 values. In this case, the deficiency is 
2, as is demonstrated below in equation (5). The true or effective dimension of a latent class model, 
i.e. the dimension of the semi-algeraic set TL representing it, is crucial for establishing identifiability 
and for computing correctly the number of degrees of freedom. In fact, if a model is deficient, then 
that the pre-image of each probability array in Ti arising from the accounting equations is a subset 
(in fact, a variety) of called the non-dentifiable subspace, with dimension exactly equal to the 
deficiency itself. Therefore, a deficient model is non-identifiable, with adjusted degrees of freedom 
equal to number of degrees of freedom for the observable graphical model plus the value of the 
deficiency. 

Theoretically, it is possible to determine the effective dimension of H by computing the maxi- 
mal rank of the Jacobian matrix for the polynomial mapping from into H given coordinatewise 




(4) 
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by (3). In fact, Geiger et al. (2001) showed that this value is equal to the dimension of H almost 
everywhere with respect to the Lebsegue measure, provided the Jacobian is evaluated at strictly 
positive parameter points 6. These symbolic evaluations, however, require the use of symbolic soft- 
ware which can only handle small tables and models, so that, in practice, computing the effective 
dimension of a latent class model is computationally difficult and often unfeasible. 

Recently, in the algebraic- geometry literature, Catalisano et al. (2002, 2003) have obtained ex- 
plicit formulas for the effective dimensions of some secant varieties which are of statistical interest. 
In particular, they show that for /c = 3 and r < Ta\n{di,d2,d^}, the latent class model has the 
expected dimension and is identifiable. On the other hand, assuming di < d2 <■■■< dk, H is 

deficient when nti " Eti (^^i - 1) < r < min {4, Flti '^^ " l}- 

Finally, under the same 

conditions, Ti is identifiable when | YliA'^i — 1) + 1 > max{(ifc, r}. Obtaining bounds and results of 
this type is highly non-trivial and is an open area of research. 

In the remainder of the paper, we will focus on simpler latent class models for tables of dimen- 
sion k = 2 and illustrate with examples the results mentioned above. For latent class models on 
two-way tables, there is an alternative, quite convenient way of describing H by representing each 
p in A^-i as a di X d2 matrix and by interpreting the map cr as a vector product. In fact, each 
point p in 5 is a rank one matrix obtained as pipj, where pi G Ad^_i and p2 G Adi-2 are the 
appropriate marginal distributions of Xi and X2. Then, the accounting equations for a latent class 
models with r-level become 

P = YjPi'\p2'^V ^h, (pi,P2,(Ai,...,A^)) G Arf^_i X Arf2_i X A^_i 

h 

i.e. the matrix p is a convex combination of r rank 1 matrices lying on the surface of independence. 
Therefore all points in H are non-negative matrices with entries summing to one and with rank 
at most r. This simple observation allows one to compute the effective dimension of H for 2- 
way table as follows. In general, a real valued di x d2 matrix has rank r or less if and only if 
the homogeneous pol3momial equations corresponding to all of its (r + 1) x (r + 1) minors all 
vanish. Provided k < mm{di,d2}, on M*^! x M.'^'^, the zero locus of all such such equations form a 
determinantal variety of co-dimension {di — r){d2 — r) (Harris, 1992, Proposition 12.2) and hence 
has dimension r{di + d2) — r^- Subtracting this value from the expected dimension computed above, 
and taking into account the fact that all the points lie inside the simplex, we obtain 

r{di + da - 2) + r - 1 - {r{di + ds) - - l) = r(r - 1). (5) 

This number is also the difference between the dimension of the (fully identifiable, i.e. of expected 
dimension) graphical model of conditional independence Xi and X2 given H, and the deficient 
dimension of the latent class model obtained by marginalizing over the variable H. 

The study of higher dimensional tables is still an open area of research. The mathematical 
machinery required to handle larger dimensions is considerably more complicated and relies on 
the notions higher-dimensional tensors, rank tensors and non-negative rank tensors, for which 
only partial results exist. See Kruskal (1975), Cohen and Rothblum (1993) and Strassen (1983) 
for details. Alternatively, Mond et al. (2003) conduct an algebraic-topological investigation of 
the topological properties of stochastic factorization of stochastic matrices representing models of 
conditional independence with one hidden variable and Allman and Rhodes (2006, 2007) explore 
an overlapping set of problems framed in the context of trees with latent nodes and branches. 
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The specific case of k-way tables with 2 level latent variables is a fortunate exception, for which 
the results for 2-way tables just described apply. In fact, Landsberg and Manivel (2004) show that 
that these models are the same as the corresponding model for any two-dimensional table obtained 
by any "flattening" of the di x . . . x dfc-dimensional array of probabilities p into a two-dimensional 
matrix. Flattening simply means collapsing the k variables into two new variables with fi and /2 
levels, and re-organizing the entries of the /c-dimensional tensor p e A^i^i into a /i x /i matrix 
accordingly, where, necessarily, /i + /2 = di. Then, H is the determinantal variety which is the 
zero set of all 3 x 3 sub-determinants of the matrix obtained by any such flattening. The second 
example in Section 4.1 below illustrates this result. 

4 Examples Involving Synthetic Data 

We further elucidate the non-identifiability phenomenon from the algebraic and geometric point 
of view, and the multi-modality of the log-likelihood function issue using few, small synthetic ex- 
amples. In particular, in the "100 Swiss Frank" problem below, we embark on a exhaustive study 
of a table with symmetric data and describe the effects of such symmetries on both the parameter 
space and the log-likelihood function. Although this example involves one of the simplest cases of 
LC models, these tables already exhibit considerable statistical and geometric complexity. 

4. 1 Effective Dimension and Polynomials 

We show how it is possible to take advantage of the polynomial nature of equations (3) to gain 
further insights into the algebraic properties of distributions obeying latent class models. All the 
computations that follow were made in SINGULAR (Greuel et al., 2005) and are described in details, 
along with more examples, in Zhou (2007). Although in principle symbolic algebraic software 
allows one to compute the set of polynomial equations that fully characterize LC models and their 
properties, this is still a rather difficult and costly task that can be accomplished only for smaller 
models. 

The accounting equations (3) determine a polynomial mapping / : ^ A"^"^ given by 



so that the latent class model can be analytically defined as the image of this map, i.e. H = 
f{&). Then, following the geometry-algebra dictionary principle (see, e.g.. Cox et al., 1996), the 
problem of computing the effective dimension of ?i can in turn be geometrically cast as a problem of 
computing the dimension of the image of a polynomial map. We illustrate how this representation 
offers considerable advantages with some small examples. 

Consider a 2 x 2 x 2 table with r = 2 latent classes. From Proposition 2.3 in Catalisano et 
al. (2002), the latent class models with 2 classes and 3 manifest variables are identifiable. The 
standard dimension, i.e. the dimension of the parameter space is r Ylii'^i — 1) + — 1 = 7, which 
coincides with the dimension of the enveloping simplex Ay. Although this condition implies that the 
number of parameters to estimate is no larger than the number of cells in the table, a case which, 
if violated, would entail non-identifiability, it does not guarantee that the effective dimension is 
also 7. This can be verified by checking that the symbolic rank of the Jacobian matrix of the map 
(6) is indeed 7, almost everywhere with respect to the Lebesgue measure. Alternatively, one can 




(6) 
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determine the dimension of the non-identifiable subspace using computational symbolic algebra. 
First, we define the ideal of polynomials determined by the 8 equations in (6) in the polynomial 
ring in which the (redundant) 16 indeterminates are the 8 joint probabilities in A7 and the 3 
pairs of marginal probabilities in Ai for the observable variables, and the marginal probabilities in 
Ai for the latent variable. Then we use implicization (Cox et al., 1996, Chapter 3) to eliminate 
all the marginal probabilities and to study the Groebner basis of the resulting ideal in which the 
indeterminates are the joint probabilities only. There is only one element in the basis. 

Pill + P112 + P121 + P122 + P211 + P212 + P221 + P222 = 1, 

which gives the trivial condition for probability vectors. This implies the map (6) is surjective, 
so that H = Aj and the effective dimension is also 7, showing identifiability, at least for positive 
distributions. 

Next, we consider the 2x2x3 table with r = 2. For this model Q has dimension 9 and the image 
of the mappings (6) is Ag. The symbolic rank of the associated Jacobian matrix is 9 as well and 
the model is identifiable. The image of the polynomial mapping determined by (6) is the variety 
associated to the ideal whose Groebner basis consists of the trivial equaiton 

Pill + P112 + P113 + P121 + P122 + P123 + P211 + P212 + P213 + P221 + P222 + P223 = 1, 
and four polynomials corresponding to the determinants 



P121 


P2II 


P22I 


P122 


P212 


P222 


Pus 


P213 


P223 


Pl+l 


P2II 


P22I 


Pl+2 


P212 


P222 


Pl+3 


P213 


P223 




P12I 


P22I 


P+12 


P122 


P222 


P+13 


P123 


P223 



(7) 



Pill P121+P211 P221 

PII2 P122 + ?'212 P222 
Pll3 P123 + P213 P223 

where the subscript symbol "+" indicates summation over that coordinate. In turn, the zero set of 
the above determinants coincide with the determinantal variety specified by the zero set of all 3 x 3 
minors of the 3x4 matrix 

Pill P121 P211 P221 \ 

P112 P122 P212 P222 j (8) 
P113 P123 P213 P223 J 

which is a flattening of the 2x2x3 array of probabilities describing the joint distribution for the 
latent class model under study. This is in accordance with the result in Landsberg and Manivel 
(2004) of mentioned above. Now, the determinantal variety given by the vanishing locus of all the 
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3x3 minors of the matrix (8) is the latent class model for a 3 x 4 table with 2 latent classes, which, 
according to (5), has deficiency equal to 2. The effective dimension of this variety is 9, computed 
as the standard dimension, 11, minus the deficiency. Then,the effective dimension of the model we 
are interested is also 9 and we conclude that the model is identifiable. 

Table 1 summarizes some of our numerical evaluations of the different notions of dimension 
for a different LC models. We computed the effective dimensions by evaluating with MATLAB the 
numerical rank of the Jacobian matrix, based on the simple algorithm suggested in Geiger et al. 

(2001) and also using SINGULAR, for which only computations involving small models were feasible. 

Table 1: Different dimensions of some latent class models. The Complete Dimension is the di- 
mension d — 1 of the envoloping probability simplex Ad-i- See also Table 1 in Kocka and Zhang 

(2002) . 







Effective 


Standard 


Complete 




Latent Class Model 


Dimension 


Dimension 


Dimension 


Denciency 


Ad_i 
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2x2 


2 


3 


5 


3 





O o 

3x3 


z 


7 


9 


8 


1 


4x5 


3 


17 


23 


19 


2 


2x2x2 


2 


/ 


/ 


/ 


u 


2x2x2 


3 


7 


11 


7 





2x2x2 


4 


7 


15 


7 





3x3x3 


2 


13 


13 


26 





3x3x3 


3 


20 


20 


26 





3x3x3 


4 


25 


27 


26 


1 


3x3x3 


5 


26 


34 


26 





3x3x3 


6 


26 


41 


26 





5x2x2 


3 


17 


20 


19 


2 


4x2x2 


3 


14 


17 


15 


1 


3x3x2 


5 


17 


29 


17 





6x3x2 


5 


34 


44 


35 


1 


10 X 3 X 2 


5 


54 


64 


59 


5 


2x2x2x2 


2 


9 


9 


15 





2x2x2x2 


3 


13 


14 


15 


1 


2x2x2x2 


4 


15 


19 


15 





2x2x2x2 


5 


15 


24 


15 





2x2x2x2 


6 


15 


29 


15 






4.2 The 100 Swiss Franc Problem 
4.2.1 Introduction 

Now we study the problem of fitting a non-identifiable 2-level latent class model to a two-way table 
with symmetry counts. This problem was suggested by Bernd Sturmfels to the participants of his 
postgraduate lectures on Algebraic Statistics held at ETH Zurich in the Summer semester of 2005 
(where he offered 100 Swiss franks for a rigorous solution), and is described in detail as Example 
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1.16 in Pachter and Sturmfels (2005). The observed table is 



( ^ 


2 


2 


2 \ 


2 


4 


2 


2 


2 


2 


4 


2 


V 2 


2 


2 


4 / 



(9) 



For the basic latent class model, the standard dimension of = A3 x A3 x Ai is 2(3 + 3) + 1 = 13 
and, by (5), the deficiency is 2. Thus, the model is not identifiable and the pre-image of each point 
p G W by the map (6) is a 2-dimensional surface in 0. To keep the notation light, we write aih 
for p"i\i) and fijh for p'^\j), where i,j = 1, . . . , 4 and a^'^^ and /J^^^ for the conditional marginal 
distribution of Xi and X2 given H = h, respectively. The accounting equations for the points in H 
become 

Pij = X] ^haihPjh, hj e [4] (10) 
/ie{i,2} 

and the log-likelihood function, ignoring an irrelevant additive constant, is 

^(9) = uij log XhaihPjh , 6* G A3 X A3 X Ai. 

ij \he{l,2} J 

It is worth emphasizing, as we did above and as the previous display clearly shows, that the ob- 
served counts are minimal sufficient statistics. 

Alternatively, we can re-parametrize the log-likelihood function using directly points in H rather 
the points in the parameter space Q. Recall from our discussion in section 3 that, for this model, the 
4x4 array p is in if and only if each 3x3 minor vanishes. Then, we can write the log-likelihood 
function as 

^(P) = J^nijlogPij, P e Ai5, det(p*j) = all i,j G [4], (11) 

where p*^- is the 3x3 sub-matrix of p obtained by erasing the ith row and the jth column. 

Although the first order optimality conditions for the Lagrangian corresponding to the parametriza- 
tion (11) are algebraically simpler and can be given the form of a system of a polynomial equations, 
in practice, the classical parametrization (10) is used in both the EM and the Newton-Raphson im- 
plementations in order to compute the maximum likelihood estimate of p. See Goodman (1979), 
Haberman (1988), and Redner and Walker (1984) for more details about these numerical proce- 
dures. 

4.2.2 Global and Local Maxima 

Using both EM and Newton-Raphson algorithm with several different starting points, we found 7 
local maxima of the log-likelihood function, reported in Table 2. The global maximum was found 
experimentally to be — 20. 8074 + consi., where const, denotes the additive constant stemming from 
the multinomial coefficient. The maximum is achieved by the three tables of fitted values Table 
2 a). The remaining four tables are local maximum of —20.8616 + const., close in value to the 
actual global maximum. Using SINGULAR (see (Greuel et al., 2005)), we checked that the tables 
found satisfy the first order optimality conditions (11). After verifying numerically the second 
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Table 2: Tables of fitted value corresponding to 7 the maxima of the likelihood equation for the 
observed table (9). a): global maximua (log-likelihood value —20.8079). b): local maxima (log- 
likelihood value -20.8616). 



/3 
3 
2 
V 2 



2\ 
2 

3 
3 / 





a) 




1 ^ 


2 3 


M 


2 


3 2 


3 


3 


2 3 


2 


V 2 


3 2 


3 / 




b) 





/3 
2 
2 
V 3 



2 
2 

3 / 



/ 8/3 8/3 8/3 2 \ 

8/3 8/3 8/3 2 

8/3 8/3 8/3 2 

V 2 2 2 4 / 
/ 8/3 2 8/3 8/3 \ 

2 4 2 2 

8/3 2 8/3 8/3 

V 8/3 2 8/3 8/3 / 



/ 8/3 8/3 2 8/3 \ 

8/3 8/3 2 8/3 

2 2 4 2 

V 8/3 8/3 2 8/3 / 
/ 4 2 2 2 \ 

2 8/3 8/3 8/3 

2 8/3 8/3 8/3 

V 2 8/3 8/3 8/3 / 



order optimality conditions, we conclude that those points are indeed local maxima. Furthermore, 
as indicated in Pachter and Sturmfels (2005), the log-likelihood function also has a few^ saddle 
points. 

A striking feature of the global maxima in Table 2 is their invariance under the action of the 
symmetric group on four elements acting simultaneously on the row and columns. Different sym- 
metries arise for the local maxima. We will give an explicit representation of these symmetries 
under the classical parametrization (10) in the next section. 

Despite the simplicity and low-dimensionality of the LC model for the Swiss franc problem and 
the strong symmetric features of the data, we have yet to provide a purely mathematical proof that 
the three top arrays in Table 2 correspond to a global maximum of the likelihood function. We view 
the difficulty and complexity of the 100 Swiss Francs problem as a consequence of the inherent 
difficulty of even small LC models and perhaps an indication that the current theory has still many 
open, unanswered problems. In Section 6, we present partial results towards the completion of the 
proof. 

4.2.3 Unidentifiable Space 

It follows from equation (5) that the non-identifiable subspace is a two-dimensional subset of 0. We 
give an explicit algebraic description of this space, which we will then use to obtain interpretable 
plots of the profile likelihood. 

Firstly, we focus on the three global maxima in Table 2 a). By the well-known properties of the 
EM algorithm (see, e.g., Pachter and Sturmfels, 2005, Theorem 1.15), if the vector of parameters 
is a stationary point in the maximization step of the EM algorithm, then is a critical point and 
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Figure 2: The 2-dimensional surface defined by equation (13), when evaluated over the ball in 
of radius 3, centered at the origin. The inner box is the unit cube [0, 1]'^. 

hence a good candidate for a local maximum. Using this observation, it is possible to show (see 
Zhou, 2007) that any point in satisfying the equations 



aih = "2h, "3/1 = "4/1 h = l,2 
Pih = P2h, Psh = Pah /i = 1, 2 
E/i >^haihPih = E/i >^ha3hP3t = 3/40 
Y^h >^haihP3h = Eh Ka3hPit = 2/40 



(12) 



is a stationary point. Notice that the first four equations in (12) require a'^^^ and P^'^^ to each 
have the first and second pairs of coordinates identical, for h = 1,2. The equation (12) defines a 
2-dimensional surface in 0. Using SINGULAR, we can verify that, holding, for example, an and Pu 
fixed, determines all of the other parameters according to the equations 

/- \, 1 

'^1 ~ 80aii;aii-20aii-20*/3ii+6 

As = 1 - Ai 

a2i = "11 

"31 = "41 = 
"12 = "22 = 
0;32 = "42 
P2I = Pll 
P3I = Pil = 
P12 = P22 = 
P32 = P42 = 



: 0.5 — ail 

. 10/3ii-3 
" 10(4/3ii~l) 

= 0.5 - Q12 
0.5 - Pu 

lOaii-3 
10(4aii-l) 

0.5 -Pu. 



Using elimination (see Cox et al., 1996, Chapter 3) to remove all the variables in the system except 
for Al, we are left with one equation 



80Aiaii/3ii - 20Aiaii - 20Ai/3ii + 6A1 - 1 = 0. 



(13) 



14 



Without the constraints for the coordinates of au, Pii and Ai to be probabilities, (13) defines a two- 
dimensional object in M^, depicted in Figure 2. Notice that the axes do not intersect this surface, 
so that zero is not a possible value for au, Pu and Ai. Because the non-identifiable space in Q is 
2-dimensional, equation (13) actually defines a bijection between an, Pn and Ai and the rest of 
the parameters. Then, the intersection of the surface (13) with the unit cube [0, 1]'^, depicted as a 
red box in Figure 2, is the projection of the whole non-identifiable subspace into the 3-dimensional 
unit cube where an, Pn and Ai live. Figure 3 displays two different views of this projection. 

The preceding arguments hold unchanged if we replace the symmetry conditions in the first 
two lines of equation (12) with either of these other two conditions, requiring different pairs of 
coordinates to be identical, namely 

aih = a^h, = "4/1, Pih = Psh, P2h = Pih (14) 

and 

Qlh = a4/i, a2h = Oizh, Plh = Pih, P2h = P3h, (15) 

where h = 1,2. 

By our computations, the non-identifiable surfaces inside Q corresponding each to one of the 
three pairs of coordinates held fixed in equations (12), (14) and (15), produce the three distinct 
tables of maximum likelihood estimates reported in Table 2 a). Figure 3 shows the projection of 
the non-identifiable subspaces for the three MLEs in Table 2 a) into the three dimensional unit cube 
for Ai, au and Pu- Although each of these three subspaces are disjoint subsets of Q, their lower 
dimensional projections comes out as unique. By projecting onto the different coordinates Ai, au 
and P21 instead, we obtain two disjoint surfaces for the first, and second and third MLE, shown in 
Figure 4. 

Table 3 presents some estimated parameters using the EM algorithm. Though these estimates 
are hardly meaningful, because of the non-identifiability issue, they show the symmetry properties 
we pointed out above and implicit in equations (12), (14) and (15), and they explain the invariance 
under simultaneous permutation of the fitted tables. In fact, the number of global maxima is the 
number of different configurations of the 4 dimensional vectors of estimated marginal probabilities 
with two identical coordinates, namely 3. This phenomenon, entirely due to the strong symmetry 
in the observed table (9), is completely separate from the non-identrifiability issues, but just as 
problematic. 

By the same token, we can show that vectors of marginal probabilities with 3 identical coor- 
dinates also produce stationary points for the EM algorithms. This type of stationary points trace 
surfaces inside Q which determine the local maxima of Table 2 b). The number of these local max- 
ima corresponds, in fact, to the number of possible configurations of 4-dimensional vectors with 
3 identical coordinates, namely 4. Figure 5 depicts the lower dimensional projections into Ai, au 
and Pu of the non-identifiable subspaces for the first MLE in Table 2 a), the first three local maxima 
and the last local maxima in Table 2 b) . 

We can summarize our finding as follows: the maxima in Table 2 define disjoint 2-dimensional 
surfaces inside the parameter space Q, the projection of one of them depicted in Figure 3. While 
non-identifiability is a structural feature of these models which is independent of the observed data, 
the multiplicity and invariance properties of the maximum likelihood estimates and the other local 
maxima is a phenomenon cause by the symmetry in the observed table of counts. 
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Table 3: Estimated parameters by the EM algorithm for the three global maxima in Table 2 a). 



Estimated Means 



Estimated Parameters 



/ 3 3 2 2 \ 

3 3 2 2 

2 2 3 3 

V 2 2 3 3 / 



/ 3 
2 

3 

V 2 

/ 3 
2 
2 

V 3 



2\ 

3 

2 
3 / 

3\ 

2 

2 
3 / 



a 



(1) 



a 



a 



1) 



(1) = ^(1) 



(1) = ^(1) 



/ 0.3474 \ 
0.3474 
0.1526 

V 0.1526 / 

/ 0.3474 \ 
0.1526 
0.3474 

V 0.1526 J 

( 0.3474 \ 
0.1526 
0.1526 

V 0.3474 / 



a 



(2) 



2) 



a 



(2) = ^(2) 



a 



(2) = ^(2) 



/ 0.1217 \ 
0.1217 
0.3783 
\ 0.3783 ) 

( 0.1217 \ 
0.3783 
0.1217 

V 0.3783 / 

/ 0.1217 \ 
0.3783 
0.3783 

V 0.1217 / 



0.5683 
0.4317 



0.5683 
0.4317 



0.5683 
0.4317 



4.2.4 Plotting the Log-likelihood Function 

Having determined that the non-identifiable space is 2-dimensional and that there are multiple 
maxima, we proceed with some plots of the profile log-likelihood function. To obtain a non-trivial 
surface, we need to consider three parameters. Figures 9 and 7 display the surface and contour 
plot of the profile log-likelihhod function for an and 021 when is one of the fixed parameters. 
Both Figures show clearly the different maxima of the log-likelihood function, each lying on the 
top of "ridges" of the log-likelihood surface which are placed symmetrically with respect to each 
others. The position and shapes of these ridges reflect, once again, the invariance properties of the 
estimated probabilities and parameters. 

4.2.5 Further Remarks and Open Problem 

We conclude this section with some observations and pointers to open problems. 

One of the interesting aspects we came across while fitting the table (9) was the proximity of the 
values of the local and global maxima of the log-likelihood function. Furthermore, although these 
values are very close, the fitted tables corresponding to global and local maxima are remarkably 
different. Even though the data (9) are not sparse, we wonder about the effect of cell sizes. Figure 
8 show the same profile log-likelihood for the table (9) multiplied by 10000. While the number 
of global and local maxima, the contour plot and the basic symmetric shape of the profile log- 
likelihood surface remain unchanged after this rescaling, the peaks around the global maxima have 
become much more pronounced and so has the difference between of the values of the global and 
local maxima. 

We have studied at a number of variations of table (9), focussing in particular on the symmetric 
data. We report only some of our results and refer to Zhou (2007) for a more extensive study. Table 
4 shows the values and number of local and global maxima for a the 6x6 version of (9). As for the 
4x4 case, we notice strong invariance features of the various maxima of the likelihood function 
and a very small difference between the value of the global and local maxima. 
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Fitting the same model to the table 

/ 1 2 2 2 \ 
2 12 2 
2 2 12 
\ 2 2 2 1 / 

we found 6 global maxima of the likelihood function, which give as many maximum likelihood 
estimates, all obtainable via simultaneous permutation of rows and columns of the table 



/ 7/4 7/4 7/4 7/4 \ 

7/4 7/4 7/4 7/4 

7/4 7/4 7/6 7/3 

V 7/4 7/4 7/3 7/6 / 



log-likelihood = -77.2927 + const. 



Based on the various cases we have investigated, we have the following conjecture, which we 
verified computationally up to dimension k = 50: 



Conjecture: The MLEs For the n x n table with values x along the diagonal and values y < x 
for off the diagonal elements, the maximum likelihood estimates for the latent class model with 

A B 



2 latent classes are the 2x2 block diagonal matrix of the form 
versions of it, where A, B, and C are 



B' C 



and the permutated 



A : 

B 
C 



y + 



Q 



and p= |_|J , g = n — p. 

We also noticed other interesting phenomena, which suggest the need for further geometric 
analysis. For example, consider fitting the (non-identifiable) latent class model with 2 classes to 
the table of counts (suggested by Bernd Sturmfels) 




Based on our computations, the maximum likelihood estimates appear to be unique, namely the 
table of fitted values 

■ 5 1 1 \ 

14 4. (16) 
14 4/ 

Looking at the non-identifiable subspace for this model, we found that the MLEs (16) can arise 
from combinations of parameters some of which can be 0, such as 



a 



0.7143 
0.1429 
0.1429 



a 



(2) = p{2) 




0.3920 
0.6080 
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This finding seems to indicate the possibility of singularities besides the obvious ones given by 
marginal probabilities for H containing coordinates (which have the geometric interpretation as 
lower order secant varieties) and by points p along the boundary of the simplex Arf_i. 
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Figure 3: Intersection of the surface defined by equation (13) with the unit cube [0, 1]^, different 
views obtained using surf in a) and MATLAB in b). 
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Figure 4: Projection of the non-identifiable spaces corresponding to the first and second and third 
MLE from Table 2 a) into the 3-dimensional unit cube where Ai, au and P21 take values. 




Figure 5: Projection of the non-identifiable spaces the first MLE in Table 2 a), the first three local 
maxima and the last local maxima in Table 2 b) into the 3-dimensional unit cube where Ai, an and 
Pii take values. In this coordinate system, the projection of non-identifiable subspaces for the first 
three local maxima in Table 2 b) results in the same surface; in order to obtain distinct surfaces, it 
would be necessary to change the coordinates over which the projections are made. 
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maximum log-likelihood when a is fixed to 0.2 








Figure 6: The plot of the profile likelihood as a function of an and 021 when 031 is fixed to 0.2. 
There are seven peaks: the three black points are the MLEs and the four gray diamonds are the 
other local maxima. 
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maximum log-likelihood when a is fixed to 0.2 



0.9 - 



0.8 - 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 

Figure 7: The contour plot of the profile likelihood as a function of an and a2i when a^i is fixed. 
There are seven peaks: the three black points are the MLEs and the four gray points are the other 
local maxima. 



x10 

-1.1005 
-1.101 ^ 



8 -1.1015 

1 -1.102 

- -1.1025 

-1.103 
1 




Figure 8: The contour plot of the profile likelihood as a function of an and a2i when 031 is fixed 
for the data (9) multiplied by 10000. As before, there are seven peaks: three global maxima and 
four identical local maxima. 
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Table 4: Stationary points for the 6x6 version of the table (9). All the maxima are invariant under 
simultaneous permutations of the rows and columns of the corresponding fitted tables. 



Fitted counts 



( A 2 2 2 2 2\ 

2 12/5 12/5 12/5 12/5 12/5 

2 12/5 12/5 12/5 12/5 12/5 

2 12/5 12/5 12/5 12/5 12/5 

2 12/5 12/5 12/5 12/5 12/5 

V 2 12/5 12/5 12/5 12/5 12/5 / 

/ 7/3 7/3 7/3 7/3 7/3 7/3 \ 

7/3 13/5 13/5 13/5 29/15 29/15 

7/3 13/5 13/5 13/5 29/15 29/15 

7/3 13/5 13/5 13/5 29/15 29/15 

7/3 29/15 29/15 29/15 44/15 44/15 

V 7/3 29/15 29/15 29/15 44/15 44/15 / 



/332 2 2 2\ 
3 3 2 2 2 2 

2 2 5/2 5/2 5/2 5/2 

2 2 5/2 5/2 5/2 5/2 

2 2 5/2 5/2 5/2 5/2 

V 2 2 5/2 5/2 5/2 5/2 / 



/ 8/3 


8/3 


8/3 


2 


2 


2 \ 


8/3 


8/3 


8/3 


2 


2 


2 


8/3 


8/3 


8/3 


2 


2 


2 


2 


2 


2 


8/3 


8/3 


8/3 


2 


2 


2 


8/3 


8/3 


8/3 


V 2 


2 


2 


8/3 


8/3 


8/3 / 


/ 7/3 


7/3 


7/3 


7/3 


7/3 


7/3 \ 


7/3 


7/3 


7/3 


7/3 


7/3 


7/3 


7/3 


7/3 


7/3 


7/3 


7/3 


7/3 


7/3 


7/3 


7/3 


7/3 


7/3 


7/3 


7/3 


7/3 


7/3 


7/3 


7/3 


7/3 


V 7/3 


7/3 


7/3 


7/3 


7/3 


7/3 / 



/ 7/3 


7/3 


7/3 


7/3 


7/3 


7/3 \ 


7/3 


35/9 


35/18 


35/18 


35/18 


35/18 


7/3 


35/18 


175/72 


175/72 


175/72 


175/72 


7/3 


35/18 


175/72 


175/72 


175/72 


175/72 


7/3 


35/18 


175/72 


175/72 


175/72 


175/72 


V 7/3 


35/18 


175/72 


175/72 


175/72 


175/72 / 



Log-likelihood 



-300.2524 + const. 



-300.1856 + const. 



-300.1729 + const. 



-300.1555 + const. (MLE) 



-301.0156 + const. 



-300.2554 + const. 
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5 Two Applications 



5.1 Example: Michigan Influenza 

Monto et al. (1985) present data for 263 individuals on the outbreak of influenza in Tecumseh, 
Michigan for during the four winters of 1977-1981: (1) Influenza type A (H3N2), December 1977- 
March 1978; (2) Influenza type A (HlNl), January 1979-March 1979; (3) Influenza type B, Jan- 
uary 1980-April 1980 and (4) Influenza type A (H3N2), December 1980-March 1981. The data 
have been analyzed by others including Haber (1986) and we reproduce them here as Table 5. The 
table is characterized by a large count for to the cell corresponding to lack of infection from any 
t3^e of influenza. 

Table 5: Infection profiles and frequency of infection for four influenza outbreaks for a sample of 
263 individuals in Tecumseh, Michigan during the winters of 1977-1981. A value of of in the first 
four columns indicates Source: Monto et al. (1985). The last column is the values fitted by the 
naive Bayes model with r = 2. 



Type of Influenza 


observed Counts 


Fitted Values 


(1) 


(2) 


(3) 


(4) 


















140 


139.5135 











1 


31 


31.3213 








1 





16 


16.6316 








1 


1 


3 


2.7168 





1 








17 


17.1582 





1 





1 


2 


2.1122 





1 


1 





5 


5.1172 





1 


1 


1 


1 


0.4292 


1 











20 


20.8160 


1 








1 


2 


1.6975 


1 





1 





9 


7.7354 


1 





1 


1 





0.5679 


1 


1 








12 


11.5472 


1 


1 





1 


1 


0.8341 


1 


1 


1 





4 


4.4809 


1 


1 


1 


1 





0.3209 



The LC model with one binary latent variable (identifiable by Theorem 3.5 in Settimi and Smith, 
2005) fits the data extremely well, as shown in Table 5. We also conducted a log-linear model anal- 
ysis of this dataset and concluded that there is no indication of second or higher order interaction 
among the four types of influenza. The best log-linear model selected via both Pearson's chi-squared 
and the likelihood ratio statistics was the model of conditional independence of influenza of type 
(2), (3) and (4) given influenza of type (1) and was outperformed by the LC model. 

Despite the reduced dimensionality of this problem and the large sample size, we report on the 
instability of the Fisher scoring algorithm implemented in the R package glim, e.g., see Espeland 
(1986). As the algorithm cycles through, the evaluations of Fisher information matrix become 
increasing ill-conditioned and eventually produce instabilities in the estimated coefficients and in 
the standard errors. These problems disappear in the modified Newton-Raphson implementation. 
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originally suggested by Haberman (1988), based on an inexact line search method known in the 
convex optimization literature as the Wolfe conditions. 

5.2 Data From the National Long Term Care Survey 

Erosheva (2002) and Erosheva et al. (2007) analyze an extract from the National Long Term Care 
Survey in the form of a 2^^ contingency table that contains data on 6 activities of daily living (ADL) 
and 10 instrumental activities of daily living (lADL) for community-dwelling elderly from 1982, 
1984, 1989, and 1994 survey waves. The 6 ADL items include basic activities of hygiene and 
personal care (eating, getting in/out of bed, getting around inside, dressing, bathing, and getting 
to the bathroom or using toilet). The 10 L\DL items include basic activities necessary to reside in 
the community (doing heavy housework, doing light housework, doing laundry, cooking, grocery 
shopping, getting about outside, travelling, managing money, taking medicine, and telephoning). 
Of the 65,536 cells in the table, 62384 (95.19%) contain zero counts, 1729 (2. 64%) contain counts 
of 1, 499 (0.76%) contain counts of 2. The largest cell count, corresponding to the (1,1,..., 1) cell, 
is 3853. 

Table 6: BIG and log-likelihood values for various values of r for the NLTCS dataset. 



r 


Dimension 


Maximal log-likelihood 


BIG 


2 


33 


-152527.32796 


305383.97098 


3 


50 


-141277.14700 


283053.25621 


4 


67 


-137464.19759 


275597.00455 


5 


84 


-135272.97928 


271384.21508 


6 


101 


-133643.77822 


268295.46011 


7 


118 


-132659.70775 


266496.96630 


8 


135 


-131767.71900 


264882.63595 


9 


152 


-131367.70355 


264252.25220 


10 


169 


-131033.79967 


263754.09160 


11 


186 


-130835.55275 


263527.24492 


12 


203 


-130546.33679 


263118.46015 


13 


220 


-130406.83312 


263009.09996 


14 


237 


-130173.98208 


262713.04502 


15 


254 


-129953.32247 


262441.37296 


16 


271 


-129858.83550 


262422.04617 


17 


288 


-129721.02032 


262316.06296 


18 


305 


-129563.98159 


262171.63265 


19 


322 


-129475.87848 


262165.07359 


20 


339 


-129413.69215 


262210.34807 



Erosheva (2002) and Erosheva et al. (2007) use an individual-level latent mixture model that 
bears a striking resemblance to the LC model. Here we report on analyses with the latter 

We use both the EM and Newton-Raphson algorithms to fit a number of LC models with up to 
20 classes, which can be shown to be all identifiable in virtue of Proposition 2.3 in Catalisano et 
al. (2002). Table 6 reports the maximal value of log-likelihood function and the value of BIG (the 
Bayesian Information Criterion), which seem to indicate that larger LC models with many levels 
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Table 7: Fitted values for the largest six cells for the NLTCS dataset for various r. 



r 


Fitted values 


2 


826.78 


872.07 


6.7 


506.61 


534.36 


237.41 


3 


2760.93 


1395.32 


152.85 


691.59 


358.95 


363.18 


4 


2839.46 


1426.07 


145.13 


688.54 


350.58 


383.19 


5 


3303.09 


1436.95 


341.67 


422.24 


240.66 


337.63 


6 


3585.98 


1294.25 


327.67 


425.37 


221.55 


324.71 


7 


3659.80 


1258.53 


498.76 


404.57 


224.22 


299.52 


8 


3663.02 


1226.81 


497.59 


411.82 


227.92 


291.99 


9 


3671.29 


1221.61 


526.63 


395.08 


236.95 


294.54 


10 


3665.49 


1233.16 


544.95 


390.92 


237.69 


297.72 


11 


3659.20 


1242.27 


542.72 


393.12 


244.37 


299.26 


12 


3764.62 


1161.53 


615.99 


384.81 


235.32 


260.04 


13 


3801.73 


1116.40 


564.11 


374.97 


261.83 


240.64 


14 


3796.38 


1163.62 


590.33 


387.73 


219.89 


220.34 


15 


3831.09 


1135.39 


660.46 


361.30 


261.92 


210.31 


16 


3813.80 


1145.54 


589.27 


370.48 


245.92 


219.06 


17 


3816.45 


1145.45 


626.85 


372.89 


236.16 


213.25 


18 


3799.62 


1164.10 


641.02 


387.98 


219.65 


221.77 


19 


3822.68 


1138.24 


655.40 


365.49 


246.28 


213.44 


20 


3836.01 


1111.51 


646.39 


360.52 


285.27 


220.47 


Observed 


3853 


1107 


660 


351 


303 


216 



are to be preferred. To provide a better sense of how well these LC models fit the data, we show in 
Table 7 the fitted values for the six largest cells, which, as mentioned, deviates considerably from 
most of the cell entries. We have also considered alternative model selection criteria such as AIC 
and modifications of it. AIC (with and without a 2nd order correction) points to /c > 20! (An ad- 
hoc modification of AIC due to Anderson et al. (1994) for overdispersed data gives rather bizarre 
results.) The dimensionality of a suitable LC model for these data appears to be much greater than 
for the individual level mixture model in Erosheva et al. (2007). 

Because of its high dimensionality and remarkable degree of sparsity this example offers an 
ideal setting in which to test the relative strengths and disadvantages of the EM and Newton- 
Raphson algorithm. In general, the EM algorithm, as a hill-climbing method, moves steadily to- 
wards solutions with higher value of the log-likelihood, but converges only linearly. On the other 
hand, despite its faster quadratic rate of convergence, the Newton-Raphson method tends to be 
very time and space consuming when the number of variables is large, and may be numerically 
unstable if the Hessian matrices are poorly conditioned around critical points, which again occurs 
more frequently in large problems (but also in small ones, such as the Michigan Influenza examples 
above) . 

For the class of basic LC models considered in this paper, the time complexity for one single 
step of the EM algorithm is O (d • r • ^ ■ dj), while the space complexity is O (d • r). In contrast, for 
the Newton-Raphson algorithm, both the time and space complexity are O (d- ■ dj). Conse- 
quently, for the NLTCS dataset, when r is bigger than 4, Newton-Raphson is sensibly slower than 
EM, and when r goes up to 7, Newton-Raphson needs more than IG of memory. Another significant 



26 



drawback of the Newton-Raphson method we experienced while fitting both the Michigan influenza 
and the NLTCS datasets is its potential numerical instability, due to the large condition numbers 
of the Hessian matrices. As remarked at the end of the previous section, following Haberman 
(1988), a numerically convenient solution is to modify the Hessian matrices so that they remain 
negative definite and then approximate locally the log-likelihood by a quadratic function. However, 
since the log-likelihood is neither concave and nor quadratic, these modifications do not necessarily 
guarantee an increase of the log-likelihood at each iteration step. As a result, the algorithm may 
experience a considerable slowdown in the rate of convergence, which we in fact observed with 
the NLTCS data. Table 8 shows the condition numbers for the true Hessian matrices evaluated at 
the numerical maxima, for various values of r. This table suggests that, despite full identifiability, 
the log-likelihood has a very low curvature around the maxima and that the log-likelihood may, in 
fact, look quite flat. 

To elucidate this point and some of the many difficulties in fitting LC models, we show in Figure 
9 the profile likelihood plot for the parameter ai2 in simplest LC model with r = 2. The actual 
profile log-likelihood is shown in red and is obtained as the upper envelop of two distinct, smooth 
curves, each corresponding to a local maxima of the log-likelihood. The location of the optimal 
value of ai2 is displayed with a vertical line. Besides illustrating multimodality the log-likelihood 
function in this example is notable for its relative flatness around its global maximum. 

Table 8: Condition numbers of Hessian matrices at the maxima for the NLTCS data. 



r 


Condition number 


2 


2.1843e + 03 


3 


1.9758e + 04 


4 


2.1269e + 04 


5 


4.1266e + 04 


6 


1.1720e + 08 


7 


2.1870e + 08 


8 


4.2237e + 08 


9 


8.7595e + 08 


10 


8.5536e + 07 


11 


1.2347e + 19 


12 


3.9824e + 08 


13 


1.0605e + 20 


14 


3.4026e + 18 


15 


3.9783e + 20 


16 


3.2873e + 09 


17 


1.0390e + 19 


18 


2.1018e + 09 


19 


2.0082e + 09 


20 


2.5133e + 16 
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Figure 9: The plot of the profile likelihood for the NLCST dataset, as a function of ai2- The vertical 
line indicates the location of the maximizen 

6 On Symmetric Tables and the MLE 

In this section, we show how S3mimetry in data allows one to symmetrize via averaging local max- 
ima of the likelihood function and to obtain critical points that are more symmetric. In various 
examples we looked at, these have larger likelihood than the tables from which they are obtained. 
We also prove that if the aforementioned averaging process always causes likelihood to go up, then 
among the 4x4 matrices of rank 2, the ones maximizing the log-likelihhod function for the 100 
Swiss Francs problem (9) are given in Table 2 a). We will further simplify the notation and will 
write L for the matrix of observed counts and M for the matrix of MLEs. 

6. 1 Introduction and Motivation 

A main theme in this section is to understand in what ways symmetry in data forces symmetry in 
the global maxima of the likelihood function. One question is whether our ideas can be extended at 
all to nonsymmetric data by suitable scaling. We prove that nonsymmetric local maxima will imply 
the existence of more symmetric points which are critical points at least within a key subspace and 
are related in a very explicit way to the nonsymmetric ones. Thus, if the EM algorithm leads to 
a local maximum which lacks certain symmetries, then one may deduce that certain other, more 
symmetric points are also critical points (at least within certain subspaces), and so check these to 
see if they give larger likelihood. There is numerical evidence that they do, and also a close look at 
our proofs shows that for "many" data points this symmetrization process is guaranteed to increase 
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maximum likelihood, by virtue of a certain single-variable polynomial encoding of the likelihood 
function often being real-rooted. 

Here is an example of our symmetrization process. Given the data 

4 2 2 2 2 2 

2 4 2 2 2 2 

2 2 4 2 2 2 

2 2 2 4 2 2' 

2 2 2 2 4 2 

2 2 2 2 2 4 

one of the critical points located by the EM algorithm is 



7/3 


7/3 


7/3 


7/3 


7/3 


7/3 


7/3 


13/5 


13/5 


13/5 


29/15 


29/15 


7/3 


13/5 


13/5 


13/5 


29/15 


29/15 


7/3 


13/5 


13/5 


13/5 


29/15 


29/15 


7/3 


29/15 


29/15 


29/15 


44/15 


44/15 


7/3 


29/15 


29/15 


29/15 


44/15 


44/15 



One way to interpret this matrix is that Mij = 7/3 + eifj where 

e = f = (0, 2/\/l5, 2/VlE, 2/Vl5, -3/Vl5, -3/\/l5). 
Our symmetrization process suggests replacing the vectors e and f each by the vector 

(l/\/l5,l/yi5,2/\/l5,2/^,-3/\/l5, -3/Vl5) 

in which two coordinates are averaged; however, since one of the values being averaged is zero, 
it is not so clear whether this should increase likelihood. However, repeatedly applying such sym- 
metrization steps to this example, does converge to a local maximum. Now let us speak more 
generally. Let M be an n by n matrix of rank at most two which has row and column sums all 
equalling kn, implying (by results of Section 6.2) that we may write Mij as k + Cifj where e, / are 
each vectors whose coordinates sum to 0. 

We are interested in the following general question: 

Question 6.1. Suppose a data matrix is fixed under simultaneously swapping rows and columns i, j. 
Consider any M as above, i.e. with Mij = k + Cifj. products also satisfied. Does e, > ej > 0, fi > 
fj > (or similarly Ci < ej < 0, fi < fj < ) imply that replacing Cj, ej each by ^^^-^ and fi, fj each 

f--\-f- 

fej 2 odvifays increases the likelihood? 

Remarks The weaker conditions Cj > ej = and fi > fj = (resp. ej < ej = 0, fi < fj = 0) 
do not always imply that this replacement will increase likelihood. However, one may consider the 
finite list of possibilities for how many zeroes the vectors e and f may each have; an affirmative 
answer to Question 6.1 would give a way to find the matrix maximizing likelihood in each case, 
and then we could compare this finite list of maxima to find the global maximum. 
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Question 6.2. Are all real-valued critical points of the likelihood function obtained by setting some 
number of coordinates in the e and f vectors to zero and then averaging by the above process so that the 
eventual vectors e and f have all positive coordinates equal to each other and all negative coordinates 
equal to each other? This seems to be true in many examples. 

One may check that the example discussed in Chapter 1 of Pachter and Sturmfels (2005) gives 
another instance where this averaging approach leads quickly to what appears to be a global max- 
imum. Namely, given the data matrix 

4 2 2 2 
2 4 2 2 
2 2 4 2 
2 2 2 4 

and a particular starting point, the EM algorithm converges to the saddle point 

4 2 3 3 
^ 2 4 3 3 
48 3 3 3 3 ' 

3 3 3 3 

whose entries may be written as Mij = 1/48(3 + Oibj) for a = (—1, 1, 0, 0) and b = (—1, 1, 0, 0). 
Averaging — 1 with and 1 with the other simultaneously in a and b immediately yields the global 
maximum directly by symmetrizing the saddle point, i.e. rather than finding it by running the EM 
algorithm repeatedly from various starting points. 

An affirmative answer to Question 6.1 would imply several things. It would yield a (positive) 
solution to the 100 Swiss francs problem, as discussed in Section 6.3. More generally, it would 
explain in a rather precise way how certain symmetries in data seem to impose symmetry on the 
global maxima of the maximum likelihood function. Moreover it would suggest good ways to look 
for global maxima, as well as constraining them enough that in some cases they can be charac- 
terized, as we demonstrate for the 100 Swiss francs problem. To make this concrete, one thing it 
would tell us for an n by n data matrix which is fixed by the 5„ action simultaneously permuting 
rows and columns in the same way, is that any probability matrix maximizing likelihood for such a 
data matrix will have at most two distinct types of rows. 

We do not know the answer to this question, but we do prove that this type of averaging will 
at least give a critical point within the subspace in which Cj , ej , fi , fj may vary freely but all other 
parameters are held fixed. Data also provides evidence that the answer to the question may very 
well be yes. At the very least, this type of averaging appears to be a good heuristic for seeking 
local maxima, or at least finding a way to continue to increase maximum likelihood beyond what 
it is at a critical point one reaches. Moreover, while real data is unlikely to have these symmetries, 
perhaps it could come close, and this could still be a good heuristic to use in conjunction with the 
EM algorithm. 

6.2 Preservation of Marginals and Some Consequences 

Proposition 6.3. Given data in which all row and column sums (i.e. marginals) are equal, then for 
M to maximize the likelihood function for this data among matrices of a fixed rank, row and column 
sums of M all must be equal. 
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We prove the case mentioned in the abstract, which should generalize by adjusting exponents 
and ratios in the proof. It may very well also generalize to distinct marginals and tables with more 
rows and columns. 

Proof. Let Ri, R2, R3, Ra be the row sums of M. Suppose Ri > R2 > R3 > Ra, other cases will be 
similar. Choose 6 so that i?3 = (1 + 5)Ra- We will show that multiplying row 4 by any 1 + e with 
< e < min(l/4, 5/2) will strictly increase L, giving a contradiction to M maximizing L. The result 
for column sums follows by symmetry. 

Let us write L{M') for the new matrix M' in terms of the variables Xij for the original matrix 
M, so as to show that L{M') > L{M). The first inequality below is proven in Lemma 6.4. 



Ri + R2 + R3 + {1 + e)i?4)^° 
[(1 + l/4(e - e2))(i?i + R2 + R3 + Ra)]^o 
[(1 + l/4(e - e2))4]i0[i?, + + i?3 + ii4]^° 

[1 + 4(l/4)(e - £2) + 6(1/4)2(6 - £2)2 + . . . + (i/4)4(e _ e^mEt=i ^*]^° 



□ 

Lemma 6.4. Ife < min(l/4, 5/2) and -Ri > i?2 > ^3 = (1 + 5)Ra, then iii + i?2 + i?3 + (1 + e)i?4 < 
(1 + l/4(e - e2))(i?i +R2+R^ + R^). 

Proof. It is equivalent to show ei?4 < (l/4)(e)(l — e) Yli=i ^i- However, 

4 

{l/4){e){l-e){Y,R^) > (3/4)(6)(l-6)(l + 5)i?4 + (l/4)(6)(l-6)i?4 

1=1 

> (3/4)(e)(l - e)(l + 2e)RA + (l/4)(e)(l - e)i?4 
= (3/4)(e)(l + e - 2e^)RA + (l/4)(e - e^)RA 

= 6i?4 + [(3/4)(e') - (6/4)(63)]ii4 - (l/4)(e')i?4 

= eR, + [{l/2){e')-{3/2){e')]R, 

> eR, + [{l/2)ie') - {3/2)ie\l/4)]R, 

> eR^. 



□ 

Corollary 6.5. There exist vectors (ei, 62, es, 64) and (/i, /2, /s, /a) such that XlLi = Z^Li = 
and Mi J = K + Cifj. Moreover, K equals the average entry size. 
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In particular, this tells us that L may be maximized by treating it as a function of just six 
variables, namely ei, 62, 63, /i, /2, /a, since 64, /4 are also determined by these; changing K before 
solving this maximization problem simply has the impact of multipl3dng the entire matrix M that 
maximizes likelihood by a scalar. 

Let E be the deviation matrix associated to M, where Eij = eifj. 

Question 6.6. Another natural question to ask, in light of this corollary, is whether the matrix of rank 
at most r maximizing L is expressible as the sum of a rank one matrix and a matrix of rank at most 
r — 1 that maximizes L among matrices of rank at most r — 1. 

Remarks When we consider matrices with fixed row and column sums, then we may ignore the 
denominator in the likelihood function and simply maximize the numerator. 

Corollary 6.7. If M which maximizes L has = ej, then it also has fi = fj. Consequently, if it has 
ei ^ ej, then it also has fi 7^ fj. 

Proof. One consequence of having equal row and column sums is that it allows the likelihood 
function to be split into a product of four functions, one for each row, or else one for each column; 
this is because the sum of all table entries equals the sum of those in any row or column multiplied 
by four, allowing the denominator to be written just using variables from any one row or column. 
Thus, once the vector e is chosen, we find the best possible / for this given e by solving four 
separate maximization problems, one for each fi, i.e. one for each column. Setting Cj = ej causes 
the likelihood function for column i to coincide with the likelihood function for column j, so both 
are maximized at the same value, implying fi = fj. □ 

Next we prove a slightly stronger general fact for matrices in which rows and columns i,j may 
simultaneously be swapped without changing the data matrix: 

Proposition 6.8. If a matrix M maximizing likelihood has Ci > Cj > 0, then it also has fi > fj > 0. 

Proof. Without loss of generality, say i = l,j = 3. We will show that if ei > 63 and fi < /a, then 
swapping columns one and three will increase likelihood, yielding a contradiction. Let 

Li(ei) = (1/4 + ei/i)"(l/4 + ei/2)2(l/4 + 61/3)2(1/4 + e.f^f 

and 

L3(e3) = (1/4 + 62/1)^(1/4 + 62/2)^(1/4 + 63/3)^(1/4 + 63/4)', 
namely the contributions of rows 1 and 3 to the likelihood function. Let 

ifi(ei) = (1/4 + 61/3)^(1/4 + 61/2)^(1/4 + ei/i)2(l/4 + 61/4)' 

and 

Ks{es) = (1/4 + 63/3)^(1/4 + 63/2)^(1/4 + 63/1)^(1/4 + 63/4)', 

so that after swapping the first and third columns, the new contribution to the likelihood function 
from rows one and three is 7^1(61)^3(63). Since the column swap does not impact that contribu- 
tions from rows 2 and 4, the point is to show i(ri(6i)iir3(63) > Li(6i)L3(63). Ignoring common 
factors, this reduces to showing 

(1/4 + 61/3)^(1/4 + 63/1)' > (1/4 + 6i/i)2(l/4 + 63/3)', 
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in other words 

(1/16 + l/4(ei/3 + ea/i) + 6163/1/3)' > (1/16 + l/4(ei/i + 63/3) + 6163/1/3)', 

namely ei/3 + 63/1 > ei/i + 63/3. But since 63 < ei, /i < /3, we have < (ei - e3)(/3 - /i) = 
(ei/3 + 63/1) - (ei/i + 63/3), just as needed. □ 

Question 6.9. Does having a data matrix which is symmetric with respect to transpose imply that 
matrices maximizing likelihood will also be symmetric with respect to transpose? 

Perhaps this could also be verified again by averaging, similarly to what we suggest for involu- 
tions swapping a pair of rows and columns simultaneously. 

6.3 The 100 Swiss Francs Problem 

We use the results derived to far so show how to reduce the 100 Swiss Francs problem to Question 
6.1. Thus, an affirmative answer to Question 6.1 would provide a mathematical proof formally 
that the three tables in 2 a) are global maxima of the log-likelihood function for the basic LC model 
with r = 2 and data given in (9) . 

Theorem 6.10. If the answer to Question 6.1 is yes, then the 100 Swiss francs problem is solved. 

Proof. Proposition 6.3 showed that for M to maximize L, M must have row and column sums 
which are all equal to the quantity which we call i2i, i?2, -^3, ^4, Ci, C2, C3, or C4 at our conve- 
nience. The denominator of L may therefore be expressed as (4Ci)^°(4C2)^*^(4(73)^''(4C4)^° or as 
(4i?i)^°(4i?2)^'^(4-R3)^°(4i?4)^°, enabling us to rewrite L as a product of four smaller functions using 
distinct sets of variables. 

Note that letting S4 simultaneously permute rows and columns will not change L, so let us 
assume the first two rows of M are linearly independent. Moreover, we may choose the first two 
rows in such a way that the next two rows are each nonnegative combinations of the first two. Since 
row and column sums are all equal, the third row, denoted vs, is expressible as xvi + (1 — x)v2 for 
vi,V2 the first and second rows and x G [0, 1]. One may check that M does not have any row or 
column with values all equal to each other, because if it had one, then it would have the other, 
reducing to a three by three problem which one may solve, and one may check that the answer 
does not have as high of likelihood as 

3 3 2 2 
3 3 2 2 
2 2 3 3" 
2 2 3 3 

Proposition 6.11 will show that if the answer to Question 6.1 is yes, then for M to maximize L, 
we must have x = or x = 1, implying row 3 equals either row 1 or row 2, and likewise row 4 
equals one of the first two rows. Proposition 6.12 shows M does not have three rows all equal 
to each other, and therefore must have two pairs of equal rows. Thus, the first column takes the 
form (a, a, b, b)^ , so it is simply a matter of optimizing a and b, then noting that the optimal choice 
will likewise optimize the other columns (by virtue of the way we broke L into a product of four 
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expressions which are essentially the same, one for each column). Thus, M takes the form 

a a h h 
a a b b 
b b a a 
b b a a 

since this matrix does indeed have rank two. Proposition 6.13 shows that to maximize L one needs 
2a = 36, finishing the proof. □ 

Proposition 6.11. If the answer to Question 6.1 is yes, then row 3 equals either row 1 or row 2 in 
any matrix M which maximizes likelihood. Similarly, each row i with i > 2 equals either row 1 or 
row 2. 

Proof. M3,3 = xMi,3 + (1 - x)M2,3 for some x e [0, 1], so Afs^a < max(Afi,3, Ms.a). If Mi,3 = M2,3, 
then all entries of this column are equal, and one may use calculus to eliminate this possibility 
as follows: either M has rank one, and then we may replace column three by (c, c, 2c, c)^ for 
suitable constant c to increase likelihood, since this only increases rank to at most two, or else the 
column space of M is spanned by (1, 1, 1, 1)-^ and some (ai, 02, 03, 04) with ^ = 0; specifically, 
column three equals (1/4, 1/4, 1/4, 1/4) + x(ai, 02, 03, 04) for some x, allowing its contribution to 
the likelihood function to be expressed as a function of x whose derivative at x = is nonzero, 
provided that 03 / 0, implying that adding or subtracting some small multiple of (ai, 02, 03, 04)^ 
to the column will make the likelihood increase. If = 0, then row three is also constant, i.e. 
63 = /s = 0. But then, an affirmative answer to the second part of Question 6.1 will imply that 
this matrix does not maximize likelihood. 

Suppose, on the other hand. Mi 3 > M2,3. Our goal then is to show 2; = 1. By Proposition 6.3 
applied to columns rather than rows, we know that (1, 1, 1, 1) is in the span of the rows, so each 
row may be written as 1/4(1, 1, 1, 1) + cv for some fixed vector v whose coordinates sum to 0. Say 
row 1 equals 1/4(1, 1, 1, 1) + kv for k = 1. Writing row three as 1/4(1, 1,1,1) + Iv, what remains is 
to rule out the possibility I < k. However, Proposition 6.8 shows that / < k and ai < 03 together 
imply that swapping columns one and three will yield a new matrix of the same rank with larger 
likelihood. 

Now we turn to the case of I < k and ai > 03. If ai = then swapping rows one and three 
will increase likelihood. Assume ai > 03. By Corollary 6.5, we have (61,62,63,64) with ei > 63 
and ifi, f2, fs, fi) with fi > f^. Therefore, if the answer to Question 6.1 is yes, then replacing 
61, 63 each by 5i±^ and /i, each by ^^^^ yields a matrix with larger likelihood, completing the 
proof. □ 

Proposition 6.12. In any matrix M maximizing L among rank 2 matrices, no three rows of M are 
equal to each other 

Proof. Without loss of generality, if M had three equal rows, then M would take the form 

a c e g 

b d f h 

b d f h 

b d f h 
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but then the fact that M maximizes L ensures d = f = h and c = e = g since L is a product of four 
expressions, one for each column, so that the second, third and fourth columns will all maximize 
their contribution to L in the same way. Since all row and column sums are equal, simple algebra 
may be used to show that all entries must be equal. However, we have already shown that such 
matrices do not maximize L. □ 

Proposition 6.13. To maximize M requires a, b related by 2a = Sb. 

Proof. We must maximize (g^^pfjpr- We may assume a + b = I since multiplying the entire matrix 
by a constant does not change L, so we maximize (l/8)^''a^6^ with 6 = 1 — a; in other words, 
we maximize /(a) = a^{l - a)'^. But solving /'(a) = = 60^(1 - a)'' + a^{A){l - af{-l) = 
a^{l - af[6{l - a) - 4a] yields 6(1 - a) - 4a = 0, so a = 6/10 and b = 4/10 as desired. □ 

7 Conclusions 

In this paper we have reconsidered the classical latent class model for contingency table data and 
studied its geometric and statistical properties. For the former we have exploited tools from alge- 
braic geometry and computation tools that have allowed us to display the complexities of the latent 
class model. We have focused on the problem of maximum likelihood estimation under LC models 
and have studied the singularities arising from symmetries in the contingency table data and the 
multiple maxima that appear to result from these. We have given an informal characterization of 
this problem, but a strict mathematical proof of the existence of identical multiple maxima has 
eluded us; we describe elements of a proof in a separate section. 

We have also appUed LC models to data arising in two real-life appUcations. In one, the model 
is quite simple and maximum likelihood estimation poses little problems, whereas in the other 
high-dimensional example various issues, computational as well as model-based, arise. From com- 
putational standpoint, both the EM and the Newton-Raphson algorithm are especially vulnerable 
to problems of multimodality and provide little in the way of clues regarding the dimensionality 
difficulties associated with the underlying structure of LC models. Furthermore, the seemingly sin- 
gular behavior of the Fisher information matrix at the MLE that we observe even for well-behaved, 
identifiable models is an additional element of complexity. 

Based on our work, we would advise practitioners to exercise caution in applying LC models, 
especially to sparse data. They have a tremendous heuristic appeal and in some examples provide a 
clear and convincing description of the data. But in many situations, the kind of complex behavior 
explored in this paper may lead to erroneous inferences. 
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